New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PCIe hotplug support for amdgpu #131
base: roc-5.0.x
Are you sure you want to change the base?
Conversation
xushuotao
commented
Apr 5, 2022
•
edited
edited
- during probing, decrement KFD lock which were incremented when device was removed; otherwise kfd_open is going to fail.
- remove p2p/io links in sysfs when device is hotplugged out.
Thank you for the patch. I haven't reviewed it in detail yet, but it seems like a simple enough change that would be worth including in our upstream driver. We do most of our public code reviews by mailing list on amd-gfx@lists.freedesktop.org for changes to the upstream driver. They should be against the amd-staging-drm-next branch that you can get from https://gitlab.freedesktop.org/agd5f/linux.git. Please run your patch through checkpatch.pl for formal coding style compliance. Thanks! |
I was signed in with the wrong account. I should have replied with this one. |
2428412
to
6e5f7b2
Compare
@fxkamd Thank you very much for your quick review. I have
Thank you very much!
|
Thank you. The p2plink stuff is not upstream yet. So we'll need to deal with that on our DKMS branch after your upstream changes land. The preferred way to send patches to amd-gfx is with "git send-email". |
@fxkamd I have pushed new code that can deal with multi-GPU system per Andrey and Joshi's comment.
|
OK. We realized that Mukul was working on basically the same issue for a different use case. He will clean up his code and post it for review soon. I believe his patch will be more general. Maybe you can give it a try when he posts it, to see if it addresses your use case. |
Currently, the IO-links to the device being removed from topology, are not cleared. As a result, there would be dangling links left in the KFD topology. This patch aims to fix the following: 1. Cleanup all IO links to the device being removed. [note by shuotao]: Also cleaned up p2plinks for backward-compatiblity for rocm-5.0.x; this shall be squashed in the upstream of amd-staging-drm-next. 2. Ensure that node numbering in sysfs and nodes proximity domain values are consistent after the device is removed: a. Adding a device and removing a GPU device are made mutually exclusive. b. The global proximity domain counter is no longer required to be an atomic counter. A normal 32-bit counter can be used instead. 3. Update generation_count to let user-mode know that topology has changed due to device removal. Reviewed-by: Shuotao Xu <shuotaoxu@microsoft.com> CC: Shuotao Xu <shuotaoxu@microsoft.com> Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
With those two patches, the system worked fine on 4-node MI100 GPU machine for hotplug capabilities with tf. This PR now has two overall commits that reflect those two patches. I have also submitted patch to amd-gfx@lists.freedesktop.org. Thank you very much for all the support! Best regards, |
Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU devices can open doors for many advanced applications in data center in the next few years, such as for GPU resource disaggregation. Current AMDKFD does not support hotplug out b/o the following reasons: 1. During PCIe removal, unbalanced kfd_locked increment during kgd2kfd_suspend. Fixed it by only incrementing the lock if drm_device is not unplugged. 2. Remove redudant p2p/io links in sysfs when device is hotplugged out. 3. New kfd node_id is not properly assigned after a new device is added after a gpu is hotplugged out in a system. libhsakmt will find this anomaly, (i.e. node_from != <dev node id> in iolinks), when taking a topology_snapshot, thus returns fault to the rocm stack. -- This patch fixes issue 1; another patch by Mukul fixes issues 2&3. -- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd/rocm-5.0.2; kernel 5.16.0-kfd is unstable out-of-box for MI100. Signed-off-by: Shuotao Xu <shuotaoxu@microsoft.com>